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Abstract 


This paper introduces the Wave Transactional Filesys¬ 
tem (WTF), a novel, transactional, POSIX-compatible 
filesystem based on a new file slicing API that enables ef¬ 
ficient file transformations. WTF provides transactional 
access to a distributed filesystem, eliminating the pos¬ 
sibility of inconsistencies across multiple files. Further, 
the file slicing API enables applications to construct files 
from the contents of other files without having to rewrite 
or relocate data. Combined, these enable a new class of 
high-performance applications. Experiments show that 
WTF can qualitatively outperform the industry-standard 
HDFS distributed filesystem, up to a factor of four in a 
sorting benchmark, by reducing I/O costs. Microbench¬ 
marks indicate that the new features of WTF impose only 
a modest overhead on top of the POSIX-compatible API. 

1 Introduction 


Distributed filesystems are a cornerstone of modern 
data processing applications. Key-value stores such as 
Google’s BigTable and Spanner IB, and Apache’s 
HBase 01 use distributed filesystems for their underlying 
storage. MapReduce Q uses a distributed filesystem 
to store the inputs, outputs, and intermediary processing 
steps for offline processing applications. Infrastructure 
such as Amazon’s EBS lU and Microsoft’s Blizzard 1281 
use distributed filesystems to provide storage for virtual 
machines and cloud-oblivious applications. 

Yet, current distributed filesystems exhibit a tension 
between retaining the familiar semantics of local filesys¬ 
tems and achieving high performance in the distributed 
setting. Often, designs will compromise consistency for 
performance, require special hardware, or artificially re¬ 
strict the filesystem interface. Eor example, in GES, 
operations can be inconsistent or, “consistent, but un¬ 
defined,” even in the absence of failures IB- GES- 
backed applications must account for these anomalies, 
leading to additional work for application programmers. 
HDES 01 side-steps this complexity by prohibiting con¬ 


current or non-sequential modifications to files. This 
obviates the need to worry about nuances in filesystem 
behavior, but fails to support use cases requiring con- 
curr ency or random-access writes. Elat Datacenter Stor¬ 
age 12911 is only eventually consistent and requires a net¬ 
work with full-bisection bandwidth, which can be cost 
prohibitive and is not possible in all environments. 

This paper introduces the Wave Transactional Eilesys- 
tem (WTE), a new distributed filesystem that contains a 
transactional model with a new API that provides file 
slicing operations. A WTE transaction may span mul¬ 
tiple files and is fully general; applications can include 
calls such as read, write, and seek within their transac¬ 
tion. This file slicing API enables applications to ef¬ 
ficiently read, write, and rearrange files without rewrit¬ 
ing the underlying data. Eor example, applications may 
concatenate multiple files without reading them; garbage 
collect and compress a database without writing the data; 
and even sort the contents of record-oriented files with¬ 
out rewriting the files’ contents. 

The key design decision that enables WTE’s advanced 
feature set is an architecture that represents filesystem 
data and metadata to ensure that filesystem-level transac¬ 
tions may be performed using, solely, transactional oper¬ 
ations on metadata. Custom storage servers hold filesys¬ 
tem data and handle the bulk of I/O requests. These 
servers retain no information about the structure of the 
filesystem; instead, they treat all data as opaque, im¬ 
mutable, variable-length arrays of bytes, called slices. 
WTE stores references to these slices in HyperDex jlSll 
alongside metadata that describes how to combine the 
slices to reconstruct files’ contents. This structure en¬ 
ables most bookkeeping to be done at the metadata level, 
within the scope of HyperDex transactions. 

Supporting this architecture is a custom concurrency 
control layer that decouples WTE transactions from 
the underlying HyperDex transactions. This layer en¬ 
sures that applications only abort when a concurrently- 
executing transaction changes the filesystem in a way 
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that generates an unresolvable, application-visible con¬ 
flict. This seemingly minor functionality enables WTF to 
support many concurrent operations with minimal abort- 
induced overheads. 

Overall, this paper makes three contributions. First, 
it describes a new API for filesystems called file slicing 
that enables efficient file transformations. Second, it de¬ 
scribes an implementation of a transactional filesystem 
with minimal overhead. Finally, it evaluates WTF and 
the file slicing interfaces, and compares them to the non¬ 
transactional HDFS filesystem. 

2 Design 

WTF’s distributed architecture consists of four compo¬ 
nents: the metadata storage, the storage servers, the repli¬ 
cated coordinator, and the client library. Figure [1] sum¬ 
marizes this architecture. The metadata storage builds 
on top of HyperDex and its expansive API. The storage 
servers hold filesystem data, and are provisioned for high 
FO workloads. A replicated coordinator service serves 
as a rendezvous point for all components of the system, 
and maintains the list of storage servers. The client li¬ 
brary contains the majority of the functionality of the 
system, and is where WTF combines the metadata and 
data into a coherent filesystem. 

In this section, we first explore the file slicing abstrac¬ 
tion to understand how the different components con¬ 
tribute to the overall design. We will then look at the 
design of the storage servers to understand how the sys¬ 
tem stores the majority of the filesystem information. Fi¬ 
nally, we discuss performance optimizations and addi¬ 
tional functionality that make WTF practical, but are not 
essential to the core design, such as replication, fault tol¬ 
erance, and garbage collection. 

2.1 The File Slicing Abstraction 

WTF represents a file as a sequence of byte arrays that, 
when overlaid, comprise the Ale’s contents. The central 
abstraction is a slice, an immutable, byte-addressable, 
arbitrarily sized sequence of bytes. A file in WTF, 
then is a sequence of slices and their associated off¬ 
sets. This representation has some inherent advantages 
over block-based designs. Specifically, the abstraction 
provides a separation between metadata and data that 
enables filesystem-level transactions to be implemented 
using, solely, transactions over the metadata. Data is 
stored in the slices, while the metadata is a sequence 
of slices. WTF can transactionally change these se¬ 
quences to change the files they represent, without hav¬ 
ing to rewrite the data. 

Concretely, file metadata consists of a list of slice 
pointers that indicate the exact location on the storage 
servers of each slice. A slice pointer is a tuple consist¬ 
ing of the unique identifier for the storage server holding 
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Figure 1: WTF employs a distributed architecture consisting of 
metadata storage, data storage, a replicated coordinator, and the 
client library. 
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Figure 2: A 4 MB file with five writes that write or overwrite 
different portions of the file. This figure shows the slices that 
were written, the resulting file’s content, and the metadata in 
HyperDex. 


the slice, the local filename containing the slice on that 
storage server, the offset of the slice within the file, and 
the length of the slice. Associated with each slice pointer 
is an integer offset that indicates where the slice should 
be overlaid when reconstructing the file. Crucially, this 
representation is self-contained: everything necessary to 
retrieve the slice from the storage server is present in the 
slice pointer, with no need for extra bookkeeping else¬ 
where in the system. As we will discuss later, the meta¬ 
data also contains standard info found in an inode, such 
as modification time, and file length. 

This slice pointer representation enables WTF to eas¬ 
ily generate new slice pointers that refer to subsequences 
of existing slices. Because the representation transpar¬ 
ently reflects the global location of a slice on disk, WTF 
may use simple arithmetic to create new slice pointers. 
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This representation also enables applications to mod¬ 
ify a file with only localized modifications to the meta¬ 
data. Figure |3 shows an example file consisting of five 
different slices. Each slice is overlaid on top of previous 
slices. Where slices overlap, the latest additions to the 
metadata take precedence. For example, slice C takes 
precedence over slices A and B; similarly, slice E com¬ 
pletely obscures slice D and part of C. The file, then, 
consists of the corresponding slices of A, C, E, and B. 
The figure also shows the compacted metadata for the 
same file. This compacted form contains the minimal 
slice pointers necessary to reconstruct the file without 
reading data that is overwritten by another slice. Cru¬ 
cially, file modifications can be performed without hav¬ 
ing to rearrange the entire metadata. 

The procedures for reading and writing follow directly 
from the abstraction. A writer creates one or more slices 
on the storage servers, and overlays them at the appro¬ 
priate positions within the file by appending their slice 
pointers to the metadata list. Readers retrieve the meta¬ 
data list, compact it, and determine which slices must be 
retrieved from the storage servers to fulfill the read. 

The correctness of this design relies upon the meta¬ 
data storage providing primitives to atomically read and 
append to the list. HyperDex natively supports both of 
these operations. Because each writer writes slices be¬ 
fore appending to the metadata list, it is guaranteed that 
any transaction that can see these immutable slices is se¬ 
rialized after the writing transaction commits. It can then 
retrieve the slices directly. The transactional guarantees 
of WTF extend directly from this design as well: a WTF 
transaction will execute a single HyperDex transaction 
consisting of multiple append and retrieve operations. 

2.2 Storage Server Interface 

The file slicing abstraction greatly simplifies the design 
of the storage servers. Storage servers deal exclusively 
with slices, and are oblivious to files, offsets, or con¬ 
current writes. Instead, the complete storage server API 
consists of just two calls that create and retrieve slices. 

A storage server processes a request to create a slice 
by writing the data to disk and returning a slice pointer 
to the caller. The structure of this request intentionally 
grants the storage server complete flexibility to store the 
slice anywhere it chooses because the slice pointer con¬ 
taining the slice’s location is returned to the client only 
after the slice is written to disk. A storage server can 
retrieve slices by following the information in the slice 
pointer to open the named file, read the requisite number 
of bytes, and return them to the caller. 

The transparency of the slice pointer minimizes the 
bookkeeping of the storage server implementation, while 
also permitting a wide variety of implementation strate¬ 
gies. Currently, each WTF storage server maintains a di- 
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Region 1 Metadata: [A, C] 

Region 2 Metadata: [B, C, D, E] 

Figure 3: A file in WTF that is partitioned into 2 MB regions. 
Writes within each region are appended solely to that region’s 
metadata. Writes that cross regions, like C, are atomically ap¬ 
plied to both lists. 

rectory of slice-containing backing files and information 
about their own identities in the system. Each backing 
file is written sequentially as the storage server creates 
new slices. 

As an optimization, the storage servers maintain mul¬ 
tiple backing files to which slices are written. This serves 
three purposes: First, it allows servers to avoid con¬ 
tention when writing to the same file; second, it allows 
the storage server to explicitly spread data across mul¬ 
tiple filesystems if configured to do so; and, finally, it 
allows the storage server to use hints provided by writers 
to improve locality within backing files, as described in 
Section im 

2.3 File Partitioning 

Practically, it is desirable to keep the list of slice pointers 
small so that they can be stored, retrieved, and transmit¬ 
ted with low overhead; however, it would be impractical 
to achieve this by limiting the number of writes to a file. 
In order to achieve support for both arbitrarily large files 
and efficient operations on the list of slice pointers, WTF 
partitions a file into fixed size regions, each with its own 
list. Each region is stored as its own object in HyperDex 
under a deterministically derived key. 

Operations on these partitioned metadata lists behave 
the same as operations on a single list. When operations 
span multiple regions, they are separated into their re¬ 
spective operations on each region, and performed within 
the context of a single multi-key HyperDex transaction. 
This guarantees that multiple regions may be modified 
simultaneously in one atomic action. Figure [3 shows a 
series of writes that span different metadata regions, and 
their resulting metadata lists. 

2.4 Filesystem Hierarchy 

The WTF filesystem hierarchy is modeled after the tra¬ 
ditional Unix filesystem, with directories and files. Each 
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API 

Description 

yank(fd,sz):slice,[data] 
paste(fd, slice) 
punch(fd, amount) 
append(fd, slice) 

Copy sz bytes from fd; return slice pointers and optionally the data 
Write slice to fd and increment the offset 

Zero-out amount bytes at the fd offset, freeing the underlying storage 
Append slice to the end of file fd 

concat (sources, dest) 
copy(source, dest) 

Concatenate the listed files to create dest 

Copy source to dest using only the metadata 


Table 1: WTF’s new file slicing API. Note that these supplement the POSIX API, which includes calls for moving a file descriptor’s 
offset via seek, concat and copy are provided for convenience and may be implemented with yank and paste. 


directory contains entries that are named links to other 
directories or files, and WTF enables files to be hard 
linked to multiple places in the filesystem hierarchy. 

WTF implements a few changes to the traditional 
filesystem behavior to reduce false dependencies when 
opening a file. If one were to implement path traver¬ 
sal as it is traditionally implemented, an open operation 
would require a traversal from the root, putting every di¬ 
rectory along the path within the scope of a transaction, 
and require several round trips to both HyperDex and the 
storage servers to open a file. 

WTF avoids traversing the filesystem on open by 
maintaining a pathname to inode mapping. This en¬ 
ables a client to map a pathname to the corresponding 
inode with just one HyperDex lookup, no matter how 
deeply nested the pathname. To enable applications to 
enumerate the contents of a single directory, WTF main¬ 
tains traditional-style directories, implemented as special 
files, alongside the one-lookup mapping. The two data 
structures are atomically updated using HyperDex trans¬ 
actions. This optimization simplifies the process of open¬ 
ing files, without a loss of functionality. 

Inodes are also stored in HyperDex, and contain 
standard information, such as link count and modifica¬ 
tion time. The inode also maintains ownership, group, 
and permissions information, though WTF differs from 
POSIX in that permissions are not checked on the full 
pathname from the root. Each inode also stores a refer¬ 
ence to the highest-offset region written within the file, 
enabling applications to find the end of the file. 

Because HyperDex permits transactions to span mul¬ 
tiple keys across independent schemas, updates to the 
filesystem hierarchy remain consistent. For example, to 
create a hardlink for a file, WTF atomically creates a new 
pathname to inode mapping for the file, increments the 
inode’s link count, and inserts the pathname and inode 
pair into the destination directory, which requires a write 
to the file holding the directory entries. 

2.5 File Slicing Interface 

The file slicing interface enables new applications to 
make more efficient use of the filesystem. Instead of op¬ 
erating on bytes and offsets as traditional POSIX systems 


do, this new API allows applications to manipulate sub¬ 
sequences of files at the structural level, without copying 
or reading the data itself. 

Table [T] summarizes the new APIs that WTF provides 
to applications. The yank, paste, and append calls 
are analogous to read, write, and append, but operate on 
slices instead of sequences of bytes. The yank call re¬ 
trieves slice pointers for a range of the file. An appli¬ 
cation may provide these slice pointers to a subsequent 
call to paste or append to write the data back to the 
filesystem, reusing the existing slices. These write oper¬ 
ations bypass the storage servers and only incur costs at 
the metadata storage component. 

The append call is internally optimized to improve 
throughput. A naive append call could be implemented 
as a transaction that seeks to the end of the file, and per¬ 
forms a paste. While not incorrect, it would allow only 
one append call to proceed at a time, because only one 
append can commit for each value for the end of file; the 
others will spuriously fail and retry. Instead, WTF stores, 
alongside the metadata list, an offset representing the end 
of the region. An append call will conditionally append 
to the list, making sure that the offset, plus the length of 
the slice to be appended, does not exceed the bounds of 
the metadata region. The entry in the metadata list for an 
append is marked as relative to the end of the file, rather 
than a specific offset. When an append is too large to 
fit within a single region, WTF will fall back on reading 
the offset of the end of file, and performing a write at 
that offset. This enables multiple append operations to 
proceed in parallel in the common case. 

Other calls that are new to the file slicing API have 
no counter-part in traditional APIs. The concat call 
concatenates multiple files to create one unified output 
file. The copy call creates a copy of a file by copying 
the file’s compacted metadata. Both of these calls may 
be implemented by yank and paste and are provided 
for convenience. 

2.6 Transaction Retry 

To ensure that transactions abort only when they en¬ 
counter application-visible conflicts, WTF implements 
its own concurrency control on top of HyperDex that re- 
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tries aborted transactions. To see why this may be nec¬ 
essary, consider an application that seeks to the end of 
a file, and writes the string “Hello World” within a sin¬ 
gle transaction. Barring any permanent failures, such a 
transaction should always succeed because this transac¬ 
tion can serialize between any other pair of transactions 
as it does not impose any requirements on the filesys¬ 
tem state. If, however, a write were to change the length 
of the file between the end-of-file lookup and the trans¬ 
action commit, the transaction encompassing the original 
seek-and-write operation will abort within HyperDex be¬ 
cause the observed value of the file length has changed. 
Passing this failure up to the application, which never 
saw the offset of the end of file, would complicate the 
guarantees made by the WTF interface. Instead, WTF 
internally retries the transaction by repeating the seek 
and then pasting the previously written slice that contains 
“Hello World” at the new end of file. This ensures that 
transactions only abort in response to an unresolvable, 
application-visible conflict. 

The mechanism that retries transactions is a thin layer 
that sits at the boundary of the WTF client library and 
the user’s application. Each call the application makes 
is logged, along with the arguments provided to the call, 
and its return value. If the transaction aborts within Hy¬ 
perDex, the state of the system remains unchanged by 
the WTF transaction, so it is safe to retry it in its entirety. 
WTF will then replay all of the user’s operations in se¬ 
quence using the same arguments originally supplied. If 
at any point a re-executed call completes with an out¬ 
come different from the original execution, the transac¬ 
tion will signal an abort to the application. Similarly, if 
the WTF transaction re-executes all operations success¬ 
fully, and the HyperDex transaction commits, the com¬ 
mit status is passed back to the application. WTF will 
retry transactions as necessary to ensure that they only 
abort when operations on the filesystem generate unre¬ 
solvable, application-visible conflicts. 

To reduce the overhead for maintaining the log of in¬ 
dividual operations, the client library uses slice pointers 
to refer to bytes of data that pass through the interface. 
For example, a write of 100 MB will not be copied and 
maintained in the log; instead, the log maintains the slice 
pointers that refer to the 100 MB on the storage servers. 
Similarly, reads are maintained using the retrieved slice 
pointers, and not the data itself or checksums thereof. 

2.7 Locality-Aware Slice Placement 

As an optimization, WTF employs a locality-aware slice 
placement algorithm to improve the locality on disk of 
writes to nearby ranges of a file. Writes to the same meta¬ 
data region reside on the same servers, and are located 
near each other on those servers’ disks. Files that are 
written to WTF sequentially will, with high probability. 


be written sequentially to disk. 

WTF chooses which server to write a slice to using 
consistent hashing iH across the servers to ensure that 
writes to the same region reside on the same storage 
server. The writer provides the slice and the identity of 
the metadata region the write affects to the storage server, 
which then uses consistent hashing to map each slice to 
a file on its local disk. The hashing function used at the 
storage server level is different from the hashing function 
used across storage servers, so writes which map to the 
same server will be unlikely to map to the same backing 
file, unless they are for the same metadata region. 

Overall, this ensures that a writer that writes sequen¬ 
tially to a file will write contiguous sequences of bytes 
on the storage servers with high probability. During com¬ 
paction, these independent slices may be combined into a 
single slice spanning the maximum contiguous range on 
the disk. For example, a sequential writer writing fixed 
size 1 MB blocks to a metadata region will sequentially 
send each of these blocks to the same storage server, 
which will append them to the same file on disk. These 
adjacent slices may be compactly represented by a single 
slice pointer that references the contiguous region. 

2.8 Garbage Collection 

WTF prevents unbounded growth of data and metadata 
through a three-tiered garbage collection mechanism. 

First, the most prevalent form of garbage in WTF 
comes from the metadata lists growing when many in¬ 
dependent append operations force it to grow. This 
predominant case is easily handled by compacting the 
metadata list, and storing the compacted list in place of 
the original list. This eliminates the garbage generated 
from overlaid slices, such as those in Figure |2l and will 
typically combine multiple slices into one because of 
locality-aware slice placement. WTF retrieves the cur¬ 
rent metadata list, compacts it, and stores the result using 
a single HyperDex transaction. The resulting file con¬ 
tents are equivalent to those from before the compaction, 
and the compaction incurs no I/O on the storage servers. 

Metadata compaction is not always sufficient. In par¬ 
ticular, random writes reduce the effect that locality- 
aware slice placement has on compaction, leading to 
fragmented metadata lists. In this case, WTF writes a 
new slice with contents identical to the compacted form 
of the current metadata list, and swaps a pointer to this 
slice with the originally observed list. 

Finally, as an application overwrites or deletes files, 
slices become unused by the filesystem and turn into 
garbage on the storage servers. Because the storage 
servers outsource all bookkeeping to the metadata stor¬ 
age, storage servers do not directly know which portions 
of its local data are garbage. WTF periodically scans the 
entire filesystem metadata and constructs a list of in-use 
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slice pointers for each storage server. For simplicity of 
implementation, these lists are stored in a reserved direc¬ 
tory within the WTF filesystem so that they need not be 
maintained in memory or communicated out of band to 
the storage servers. Storage servers link the WTF client 
library and read their respective files to discover unused 
regions in their local storage space. To prevent the race 
condition where a slice is created and garbage collected 
before being referenced by the metadata, the periodic 
garbage collection is run infrequently—on the order of 
hours or days—and servers do not collect an unused re¬ 
gion until it appears in two consecutive scans. 

Storage servers implement garbage collection by cre¬ 
ating sparse files on the local disk. To compress a file 
containing garbage slices, a storage server rewrites the 
file, seeking past each unused slice. On inode-based 
Linux filesystems this creates a sparse file that occupies 
disk space proportional to the in-use slices it contains. 
Counter-intuitively, files with the most garbage are the 
most efficient to collect, because the garbage collection 
thread seeks past large regions of garbage and only writes 
the small number of remaining slices. Backing files with 
little garbage incur much more I/O, because there are 
more in-use slices to rewrite. WTF chooses the file with 
the most garbage to compact first, because it will simul¬ 
taneously compact the most garbage and incur the least 
FO. 

The storage servers derive benefit from the kernel 
buffer cache by relying upon writing to a local filesystem 
rather than direct disk access. When writing a file, Linux 
will not start to flush the data to disk immediately, but 
will instead flush data in batched writes. The filesystem 
coalesces many writes and reduces the number of seeks 
used by garbage collection S. 

2.9 Fault Tolerance 

WTF uses replication to add a configurable degree of 
fault tolerance to the system. To accomplish this, it aug¬ 
ments the metadata list such that each entry references 
multiple slice pointers that are replicas of the data. On 
the write path, writers create multiple replica slices and 
append their pointers atomically. Readers may read from 
any of the replicas, as they hold identical data. 

The metadata storage derives its fault tolerance from 
the strong guarantees offered by HyperDex. Specifically, 
HyperDex guarantees that it can tolerate / failures for 
a user-configurable value of /. HyperDex uses value- 
dependent chaining to coordinate between the replicas 
and manage recovery from failures H. 

The data storage derives its durability guarantees from 
the backing file system. While replication protects WTF 
against uncorrelated failures, WTF is not designed to 
withstand correlated failures such as cluster-wide power 
outages. 


The file slicing abstraction is easier to make fault tol¬ 
erant and consistent than existing block-based solutions. 
In a block-based design a write is often constrained to 
reuse existing replicas for the block it is writing. Further, 
block designs often employ some mechanism on top of 
the block servers to consistently update all replicas, or at 
least ensure they eventually converge to the same value. 
This added mechanism introduces overheads that are ab¬ 
sent in WTF’s slice-based design. 

3 Implementation 

Everything described in this paper is available in our 
WTF implementation. Currently, the implementation is 
approximately 30 k lines of code written exclusively for 
WTF. It relies upon HyperDex with transactions, which 
is approximately 85 k lines of code, with an additional 
37 k lines of code of supporting libraries written for both 
projects. The replicated coordinator for both HyperDex 
and WTF is an additional 19 k lines of code. Altogether, 
WTF constitutes 171k lines of code that were written for 
WTF and HyperDex. 

WTF’s fault tolerant coordinator is implemented as a 
replicated object on top of the Replicant replicated state 
machine service. The coordinator consists of just 960 
lines of code that are compiled into a dynamically linked 
library that is passed to Replicant. Replicant deploys 
multiple copies of the library, and uses Paxos lE^I to se¬ 
quence the function calls into the library. 

4 Evaluation 

To evaluate WTF, we will look at a series of both end-to- 
end and micro benchmarks that demonstrate our working 
implementation under a variety of conditions. The first 
part of this section looks at the how the file slicing inter¬ 
face improves an end-to-end sorting benchmark written 
in the style of a map reduce application. We will then 
look at a series of microbenchmarks that characterize the 
performance of WTF’s conventional filesystem interface. 

All benchmarks execute on a cluster of fifteen servers 
dedicated to the experiment. Each server is equipped 
with two Intel Xeon 2.5 GHz L5420 processors, 16 GB 
of DDR2 memory with eri'or correction, and between 
500 GB and 1 TB SATA spinning-disks from the same 
era as the CPUs. The servers are connected with giga¬ 
bit ethernet via a single top of rack switch. Installed on 
each server is 64-bit Ubuntu 14.04, HDES from Apache 
Hadoop 2.7, and WTE with HyperDex 1.8.1. 

Eor all benchmarks, HDES and WTE are configured to 
provide an apples-to-apples comparison. Both systems 
are deployed with three nodes reserved for the meta¬ 
data—the HDES name node, or the HyperDex cluster— 
and the remaining twelve servers are allocated as stor¬ 
age nodes for the data. Except for changes necessary 
to achieve feature parity, both systems were deployed 
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Stage 

Conventional 

Pile Slicing 

Bucketing 

R= 100 GB 
W= 100 GB 

R= 100 GB 
W = 0GB 

Sorting 

R= 100 GB 
W= 100 GB 

R= 100 GB 
W = 0GB 

Merging 

R= 100 GB 
W= 100 GB 

R = 0GB 

W = 0GB 

Total 

R = 300 GB 

W = 300 GB 

R = 200 GB 
W = 0GB 


Table 2: File slicing enables the WTF-based sort application to 
sort a 100 GB file with one third the I/O required by conven¬ 
tional distributed filesystems. 

in their default configuration. To bring the semantics 
of HDFS up to par with WTF, each write is followed 
by an hflush call to ensure that the write is flushed 
from the client-side buffer and is visible to readers. The 
hflush primitive solely makes sure that writes are vis¬ 
ible to all readers, and does not trigger an f sync on the 
written data; the resulting guarantee is the same guaran¬ 
tee provided by a WTF write, and no stronger. 

Additionally, in order to work around a long-standing 
bug with append operations ||3l, the HDFS block size was 
reduced from 128 MB to 64 MB. Without this change 
to the configuration, the HDFS node can report an out- 
of-disk-space condition when only 3% of the disk space 
is in use. Instead of gracefully handling the condition 
and falling back to other replicas as is done in WTF, the 
failure cascades and causes multiple writes to fail, mak¬ 
ing it impossible to complete the benchmark. Decreasing 
the block size does increase the amount of metadata held 
on the name node, but because all data is held within 
main memory, and our workloads do not generate more 
metadata than the HDFS name node’s memory capacity, 
the increase is irrelevant to our benchmarks. The change 
is unlikely to impact the performance of data nodes be¬ 
cause the increase from 64 MB to 128 MB was not moti¬ 
vated by performance |@]. WTF is also configured to use 
64 MB regions. 

Except where otherwise noted, both systems replicate 
all files such that two copies of the file exist. This allows 
the filesystem to tolerate the failure of any one storage 
server throughout the experiment without loss of data or 
availability. It is possible to tolerate more failures so long 
as all the replicas for a file do not fail simultaneously. 

4.1 Map Reduce: Sorting 

MapReduce Cl is a processing technique that forms the 
basis of many modern analytic applications. Because 
filesystems like HDFS and GFS are the basis of modern 
mapreduce frameworks, mapreduce applications provide 
a useful means of evaluating new distributed filesystems. 

Sorting a file with mapreduce is a three-step process 



Figure 4: Total execution time for sorting a 100 GB file using a 
map-reduce application. HDFS takes more than one hour and 
seven minutes to sort the file, while WTF completes the same 
task in under fifteen minutes. 

that breaks the sort into two map jobs followed by a re¬ 
duce job. The first map task partitions the input file into 
buckets, each of which holds a disjoint, contiguous sec¬ 
tion of the keyspace. These buckets are sorted in parallel 
by the second map task. Finally, the reduce phase con¬ 
catenates the sorted buckets to produce the sorted output. 

Each intermediate step of this application is written 
to disk, implying that the entire data set will be read or 
written several times over. Here, WTE’s file slicing inter¬ 
face can reduce this excessive I/O and improve the effi¬ 
ciency of the application. Instead of reading and writing 
whole records during the first two stages, WTE can use 
yank and paste to rearrange the records. Eile slicing 
also eliminates almost all I/O of the reduce phase using 
a concat operation. Tabled summarizes the number of 
bytes of data we can expect to be read or written while 
sorting a 100 GB file. We can see that a conventional 
API will perform 600 GB of total I/O while a file-slicing 
filesystem can do the same task with only 200 GB of I/O. 

Empirically, the file slicing operations do improve the 
running time of a WTE-based sort. Eigure |4] shows the 
total running time of both systems to sort a 100 GB file 
consisting of 500 kB records indexed by lOB keys that 
were generated uniformly at random. In this benchmark, 
the intermediate files are written without replication be¬ 
cause they may easily be recomputed from the input. We 
can see that WTE sorts the entire file in one fourth the 
time taken to perform the same task on HDES. 

The speedup is attributable to the efficient primitives 
that WTE exposes to applications. Erom Eigure |5] we 
can see that the WTE-based sorting application spends 
less time in the partitioning and merging steps than the 
conventional HDES-based application. Eor HDES, the 
majority of execution time is spent in merging and buck¬ 
eting of the input data. Just 8.5% of execution time is 
spent in the CPU-intensive sorting task. The rest is spent 
shuffling data on either side of this task. In contrast. 
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Figure 5: Execution time of the sort broken down by stage 
of the map-reduce application. FIDFS spends 91.5% of its 
time partitioning and reassembling the data, compared to WTF, 
which spends 25.9% of its time on the same task. 

WTF spends 74.1% of its time in the CPU intensive task, 
whereas the first map phase accounts for 25.3% of the 
execution time. The concatenation operation at the end 
occupies less than 1% of the overall running time. From 
this, we can conclude that the efficiency of WTF’s I/O 
operations contribute to reducing the overall runtime of 
the sort operation. 

Overall this sorting benchmark shows that hie slic¬ 
ing operations can improve map reduce performance. In 
general, applications that process data by partitioning, 
shuffling, or combining records will beneht from a re¬ 
duction in I/O and decrease in running time. 

4.2 Micro Benchmarks 

In this section we examine a series of microbenchmarks 
that quantify the performance of the POSIX API for both 
HDFS and WTF. Here HDFS serves as a gold-standard. 
With ten years of active development, and deployment 
across hundreds of nodes, including large deployments at 
both Facebook and Linkedin ifTlI] . HDFS provides a rea¬ 
sonable estimate of distributed hlesystem performance. 
Although we cannot expect WTF to grossly outperform 
HDFS—^both systems are limited by the speed of the 
hard disks in our cluster—we can use the degree to which 
WTF and HDFS differ in performance to estimate the 
overheads present in WTF’s design. 

Setup The workload for these benchmarks is generated 
by twelve distinct clients, one per storage server in the 
cluster, that all work in parallel. This configuration was 
chosen after experimentation because additional clients 
do not significantly increase the throughput, but do in¬ 
crease the latency signihcantly. 

All benchmarks operate on 100 GB of data, or over 
16 GB per machine once replication is accounted for. 
This workload is small enough that we can run the ex¬ 
periments several times each, but is big enough to be 
blocked by disk on modern Linux kernels. The Linux 


Figure 6: Performance of a one-server deployment of HDFS 
and WTF compared with the ext4 filesystem. Error bars indi¬ 
cate the standard error of the mean across seven trials. 


virtual memory subsystem will not allow a writing pro¬ 
cess to populate the entirety of RAM with dirty buffers; 
instead, only a fraction of memory may be used for dirty 
pages before the kernel forces writing processes to yield 
time for writing back I/O 11241] . Consequently, although 
our test data is not multiple times the memory available 
in our cluster, it is more than five times the space avail¬ 
able for storing dirty buffers. To mitigate any confound¬ 
ing effects of the kernel’s buffer cache on read-oriented 
experiments, the buffer cache was completely cleared be¬ 
fore each such experiment. 

Single server performance This first benchmark ex¬ 
ecutes on a single server to establish the baseline per¬ 
formance of a one node cluster. Here, we’ll not only 
compare the two systems to each other, but to the same 
workload implemented on a local ext4 filesystem. Our 
expectation here is that the POSIX API will provide an 
upper bound on performance. To reduce the extent to 
which round trip time dominates the calls in each dis¬ 
tributed system the client and storage server are collo¬ 
cated. Figure |6] shows the throughput of write and read 
operations in the one-node cluster. From this we can see 
that the maximum measured throughput of a single node 
in our cluster is 87 MB/s, which means the total through¬ 
put of the cluster, assuming optimal usage, will peak at 
1044 MB/s. 

Sequential Writes WTF guarantees that all readers in 
the filesystem see a write upon its completion; how¬ 
ever, this guarantee is only useful to applications when 
throughput remains high for smaller writes. This bench¬ 
mark examines the impact that write size has on the ag¬ 
gregate throughput achievable for filesystem-based ap¬ 
plications by varying the block size and measuring the 
aggregate throughput across all twelve writers. Figure|7] 
shows the results for block sizes between 256 kB and 
64 MB. For writes greater than 1 MB, WTF achieves 
97% the throughput of HDFS. For 256 kB writes, WTF 
achieves 84% of the throughput of HDFS. 
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Figure 7: Aggregate throughput of a sequential write work¬ 
load where writers make fixed size calls to “write”. FIDFS and 
WTF both provide applications with approximately 400 MB /s 
of goodput. Error bars report the standard error of the mean 
across seven trials. 



Block Size (bytes) 


Figure 8: Median latency of write operations across a variety 
of write sizes. Error bars report the 5th and 95th percentile 
latencies. 

The latency for the two systems is similar, and directly 
correlated with the block size. Figure^shows the latency 
of writes across a variety of block sizes. We can see that 
WTF’s median latency is very close to HDFS’s median 
latency for larger writes, and that the 95th percentile la¬ 
tency for WTF is often lower than on HDFS operations. 
Latency of WTF write operations diverges from HDFS 
for 256 kB writes. Each HyperDex transaction in WTF 
imposes an approximately 3 ms lower bound on the to¬ 
tal write completion time. For the 256 kB test case, this 
is 50% of the median latency. Even so, WTE’s median 
and 95th percentile latency measurements for this block 
size are only 2 ms higher than the corresponding mea¬ 
surements for HDES. 

Random Writes WTE enables applications to write 
at random offsets in a file without restriction. Because 
HDES cannot support applications that write at random 
offsets within a file, we cannot use it as a baseline 
for these experiments; instead, the sequential write per¬ 
formance of WTE will serve as a baseline to compare 
against the random write performance. This this bench¬ 


Figure 9: Aggregate throughput of concurrent writers making 
fixed size calls to “write” at random offsets within a file. Error 
bars report the standard error of the mean across seven trials. 

mark issues writes at uniformly random offsets instead 
of sequentially increasing offsets. 

Figure |9] shows the aggregate throughput achieved by 
clients randomly writing to WTF. We can see that the 
random write throughput is always within a factor of two 
of the sequential throughput, and that this difference di¬ 
minishes as the size of the writes approaches 8 MB. 

Because the common case for a sequential write and 
a random write in WTF differ only at the stage where 
metadata is written to HyperDex, we expect that such 
a difference in throughput is directly attributable to the 
metadata stage. HyperDex provides lower latency vari¬ 
ance to applications with a small working set than ap¬ 
plications with a large working set with no locality of 
access. We can see the difference this makes in the tail 
latency of WTF writes in Figure [TOl which shows the 
median and 99th percentile latencies for both the sequen¬ 
tial and random workloads. The median latency for both 
workloads is the same for all block sizes. For block sizes 
4 MB and larger, the 99th percentile latencies are approx¬ 
imately the same as well. Writes less than 4 MB in size 
exhibit a significant difference in 99th percentile latency 
between the sequential and random workloads. These 
smaller writes spend more time updating HyperDex than 
writing to storage servers. We expect that further opti¬ 
mization of HyperDex would close the gap between se¬ 
quential and random write performance. 

Although the difference between sequential and ran¬ 
dom performance is significant, it is important to re¬ 
member that HDFS applications cannot perform random 
writes at all. With HDFS, applications that need to 
change a file must rewrite the file in its entirety, which 
is a costly and slow process. 

Sequential Reads Batch processing applications of¬ 
ten read large input files sequentially during both the 
map and reduce phases. Although a properly-written 
application will double-buffer to avoid small reads, the 
filesystem should not rely on such behavior to enable 


9 


























Block Size (bytes) 


Figure 10: Median and 99th percentile latencies for sequential 
and random WTF writes. The median latency does not change 
between sequential and random write patterns. 
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Figure 11: Aggregate throughput of concurrent readers reading 
fixed size blocks. HDFS and WTF both achieve approximately 
900MB/s of read throughput. Error bars report the standard 
error of the mean across seven trials. 

high throughput. This experiment shows the extent to 
which WTF can be used by batch applications by read¬ 
ing through a file sequentially using a fixed-size buffer. 

Figure [TT] shows the aggregate throughput of concur¬ 
rent readers reading through a file written by the previ¬ 
ously described sequential write benchmark. We can see 
that for all read sizes, WTF’s throughput is at least 80% 
the throughput of HDFS. The throughput reported here 
is not comparable to the throughput reported in the write 
benchmark because only one of the two active replicas 
is consulted on each read, thus doubling the number of 
disks available for independent operations. For smaller 
reads, WTF’s throughput matches that of HDFS. The dif¬ 
ference at larger sizes is largely an artifact of the imple¬ 
mentations. HDFS uses readahead on both the clients 
and storage servers in order to improve throughput for 
streaming workloads. By default and in the experiment, 
the HDFS readahead is configured to be 4 MB, which is 
the point at which the systems start to exhibit different 
characteristics. Our preliminary WTF implementation 
does not have any readahead mechanism, and exhibits 


Figure 12: Aggregate throughput of random reads of varying 
size in a two-replicated deployment. We can see that WTF- 
backed applications achieve higher throughput than HDFS ap¬ 
plications for a variety of small read sizes. Error bars indicate 
the standard error of the mean across seven trials. 

higher latency. A more mature implementation could 
take advantage of readahead to reduce this difference. 
Random Reads Applications built on a distributed 
filesystem, such as key-value stores or record-oriented 
applications often require random access to the files. 
This experiment shows the performance of applications 
reading constant-sized pieces from a file at offsets that 
are chosen uniformly at random. 

Figure [12] shows the aggregate throughput of twelve 
concurrent random readers. We can see that for reads 
of less than 16 MB, WTF achieves significantly higher 
throughput—at its peak, WTF’s throughput is 2.4x the 
throughput of HDFS. Here, the readahead and client- 
side caching that helps HDFS with larger sequential read 
workloads adds overhead to HDFS that WTF does not 
incur. The 95th percentile latency of a WTF read is less 
than the median latency of a HDFS read for block sizes 
less than 4 MB. 

Scaling the Workload This experiment varies the 
number of clients writing to the filesystem to explore 
how concurrency affects both latency and through¬ 
put. This benchmark employs the workload from the 
sequential-write benchmark with a 4 MB write size and 
a variable number of workload-generating clients. 

Figure [T3 shows the resulting throughput for between 
one and twelve clients. We can see that the single client 
performance is approximately 60 MB/s, while twelve 
clients sustain an aggregate throughput of approximately 
380 MB/s. WTF’s throughput is approximately the same 
as the throughput of HDFS for each data point. Run¬ 
ning the same workload with forty-eight clients did not 
increase the throughput beyond the throughput achieved 
with twelve clients. We can see the corresponding la¬ 
tency change in Figure [T4| 

Garbage Collection This benchmark measures the 
overhead of garbage collection on a storage server. As 
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Figure 13: Aggregate throughput as the number of writers in¬ 
creases. Error bars show the standard error of the mean across 
seven trials. 
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Figure 14: Median write latency as the number of writers in¬ 
creases. Error bars show the 5th and 95th percentile latencies. 

mentioned in Section 12.81 it is more efficient to collect 
files with more garbage than files with less garbage, and 
WTF preferentially garbage collects these larger files. 
Figure [15] shows the rate at which the cluster can col¬ 
lect garbage, for varying amounts of randomly located 
garbage, when all resources are dedicated to the task. We 
can see that when the cluster consists of 90% garbage, 
the cluster can reclaim this garbage at a rate of over 9 GB 
of garbage per second, because it need only write 1 GB/s 
to reclaim the garbage. 

It is, however, impractical to dedicate all resources 
to garbage collection; instead, WTF dedicates only a 
fraction of I/O to the task. Storage servers initiate 
garbage collection when disk usage exceeds a config¬ 
urable threshold, and ceases when the amount of garbage 
drops below 20%. Figure [TSl shows that the maximum 
overhead required to maintain the system below this 
threshold is 4%. 

5 Related Work 

Filesystems have been an active research topic since the 
earliest days of systems research. Existing approaches 
related to WTF can be broadly classified into two cate¬ 
gories based upon their design. 


Figure 15: The maximum rate of garbage collection is posi¬ 
tively correlated with the amount of garbage to be collected. 
Consequently, WTF dedicates a small fraction of its overall I/O 
to garbage collection. 


Distributed filesystems Distributed filesystems ex¬ 
pose one or more units of storage over a network to 
clients. AFS exports a uniform namespace to work¬ 
stations, and stores all data on centralized servers. Other 
systems Him [Hi, most notably xFS 13] and Swift |@] 
stripe data across multiple servers for higher perfor¬ 
mance than can be achieved with a single disk. Petal 
provides a virtual disk abstraction that clients may use 
as a traditional block device. Frangipani builds a 
filesystem abstraction on top of Petal. NASD iS and 
Panasas ll43n employ customized storage devices that at¬ 
tach to the network to store the bulk of the metadata. 
In contrast to these systems, WTF provides transactional 
guarantees that can span hundreds or thousands of disks 
because its metadata storage scales independently of the 
number of storage servers. 

Recent work has focused on building large-scale 
datacenter-centric filesystems. GFS |17] and HDFS |0] 
employ a centralized master server that maintains the 
metadata, mediates client access, and coordinates the 
storage servers. Salus 14211 improves HDFS to support 
storage and computation failures without loss of data, 
but retains the central metadata server. This centralized 
master approach, however, suffers from scalability bot¬ 
tlenecks inherent to the limits of a single server 12711 . 
WTF overcomes the metadata scalability bottleneck us¬ 
ing the scalable HyperDex key-value store iS. 

CalvinFS iH focuses on fast metadata management 
using distributed transactions in the Calvin 0 trans¬ 
action processing system. Transactions in CalvinFS 
are limited, and cannot do read-modify-write opera¬ 
tions on the filesystem without additional mechanism. 
Further, CalvinFS addresses file fragmentation using 
a heavy-weight garbage collection mechanism that en¬ 
tirely rewrites fragmented files; in the worst case, a se¬ 
quential writer could incur I/O that scales quadratically 
in the size of the file. In contrast, WTF provides fully 
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general transactions and carefully atTanges data to im¬ 
prove sequential write performance. 

Another approach to scalability is demonstrated by 
Flat Datacenter Storage lE^ . which enables applications 
to access any disk in a cluster via a CLOS network 
with full bisection bandwidth. To eliminate the scalabil¬ 
ity bottlenecks inherent to a single master design, FDS 
stores metadata on its tract servers and uses a central¬ 
ized master solely to maintain the list of servers in the 
system. Blizzard lEsll builds block storage, visible to ap¬ 
plications as a standard block device, on top of FDS, us¬ 
ing nested striping and eventual durability to service the 
smaller writes typical of POSIX applications. These sys¬ 
tems are complementary to WTF, and could implement 
the storage servers abstraction. 

Power-proportional filesystems are elastic, in that they 
dynamically change the power consumption of a cluster 
to scale resource usage with demand and decrease power 
consumption in the cluster ^ ^ ^]. WTF’s design 
does not consider power-proportionality, but could possi¬ 
bly incorporate allocation techniques from other systems 
to make it more elastic. 

Other “blob” storage systems behave similarly to file 
systems, but with a restricted interface that permits creat¬ 
ing, retrieving, and deleting blobs, without efficient sup¬ 
port for arbitrarily changing or resizing blobs. Face- 
book’s f4 El ensures infrequently accessed files are 
readily available for access. Pelican (H enables power- 
efficient cold storage by over provisioning storage space, 
and selectively turning on subsets of the disks to service 
requests. The design goals of these systems are different 
from the interactive, online applications that WTF en¬ 
ables; WTF could be used in front of these systems to 
generate, maintain, and modify data before placing it in 
warm or cold storage. 


Transactional filesystems Transactional filesystems 
enable applications to offload much of the hard work re¬ 
lating to update consistency and durability to the filesys¬ 
tem. The Quicksilver operating system shows that trans¬ 
actions across the filesystem simplify application devel¬ 
opment E2l- Further work showed that transactions 
could be easily added to LFS, exploiting properties of the 
alread y-lo g-structured data to simplify the design Eltl. 
Valor 13611 builds transaction support into the Linux ker¬ 
nel by interposing a lock manager between the kernel’s 
VFS calls and existing VFS implementations. In contrast 
to the transactions provided by WTF, and the underlying 
HyperDex transactions, these systems adopt traditional 
pessimistic locking techniques that hinder concurrency. 

Optimistic concurrency control schemes often en¬ 
able more concurrency for lightly-contended workloads. 
PerDiS FS adopts an optimistic concurrency control 
scheme that relies upon external components to reconcile 
concurrent changes to a file m. This allows users and 


applications to concutTently work on the same file; ac¬ 
cording to the authors, the most commonly adopted tech¬ 
nique is selecting one version and throwing the rest away. 
Liskov and Rodrigues show that much of the overhead of 
a serializable filesystem can be avoided by running read¬ 
only transactions in the recent past, and employing an op¬ 
timistic protocol for read-write transactions jloR . WTF 
builds on top of HyperDex’s optimistic concurrency and 
supports operations such as append that avoid creating 
conflicts between concurrent transactions. 

WTF is not the first system to choose to employ 
a transactional database as part of its design. Inver¬ 
sion builds on PostgreSQL to maintain a complete 
filesystem. KBDBFS and Amino both build on 
top of BerkeleyDB; the former is an in-kernel implemen¬ 
tation of BerkeleyDB, while the latter eschews the com¬ 
plexity and takes a performance hit with a userspace im¬ 
plementation. WTF differs from these designs in that it 
stores solely the metadata in the transactional data store; 
data is stored elsewhere and not managed by the transac¬ 
tional component. Further, its design ensures that trans¬ 
actions on metadata are sufficient to provide filesystem- 
level transactions. 

Stasis 13411 makes the argument that no one design 
support all use cases, and that transactional components 
should be building blocks for applications. WTF’s ap¬ 
proach is similar; HyperDex’s transactions are used as a 
base primitive for managing WTF’s state, and WTF sup¬ 
ports a transactional API. Applications built on WTF can 
use this API to achieve their own transactional behavior. 


6 Conclusion 

This paper described the Wave Transactional Filesystem 
(WTF), a new distributed filesystem that enables applica¬ 
tions to operate on multiple files transactionally without 
requiring complex application logic. A new filesystem 
abstraction called slicing further boosts performance 
by modifying files more efficiently than traditional prim¬ 
itives permit. The main insight behind file slicing is that 
it enables applications to read and write using references 
to data that is stored elsewhere in the filesystem. 

A broad evaluation shows that WTF achieves through¬ 
put and latency similar to industry-standard HDFS, while 
simultaneously offering stronger guarantees and a richer 
API. A sample application built with file slicing outper¬ 
forms traditional approaches by a factor of four by re¬ 
ducing the overall I/O cost. 

The ability to make transactional changes to multiple 
files at scale is novel in the distributed systems space, 
and the file slicing APIs enable a new class of applica¬ 
tions that are difficult to implement efficiently with cur¬ 
rent APIs. Together, these features are a potent combina¬ 
tion that enables a new class of high performance appli¬ 
cations. 
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